In this writeup I examine the following:

  1. How does sample size vary across datasets?

  2. How does the “exact” version of SCEPTRE (i.e., the version of SCEPTRE that returns an “exact” p-value based on B = 300,000 resamples) perform on the Frangieh IFN-gamma, Papalexi gene, and Papalexi protein data?

A question that I seek to answer is whether SCEPTRE exhibits CAMP, or “Confounder Adjustment via Marginal Permutations” (previously known as implicit confounder adjustment).

Sample size

Sample size has emerged as an important quantity in low MOI (and even in high MOI). Here, I study issues related to sample size in a more precise and in-depth way than I have previously.

Definitions of relevant quantities

I begin by defining several quantities – n_nonzero_treatment, n_nonzero_control, and effective_sample_size – on the negative control data. Let a dataset be given. Let \(d\) be the number of NTCs, let \(p\) be the number of genes, and let \(n\) be the number of cells in the dataset. Let \(X \in \mathbb{R}^{n \times p}\) be the cell-by-gene expression matrix. Next, for \(k \in \{1, …, d\},\) let \(n_k\) be the number of cells containing NTC \(k\) (where the NTCs are arbitrarily indexed). Also, let \(X^{1} \in \mathbb{R}^{n_1 \times p}\) be the submatrix of \(X\) consisting of the cells that received NT 1 (and similarly for \(X^2, \dots, X^d\)). For a given NTC \(k\) and gene \(j\), let $$\texttt{n_nonzero_treatment}_{kj}$$ be the number of cells containing NTC \(k\) in which gene \(j\) has nonzero expression, i.e.

\[ \texttt{n_nonzero_treatment}_{jk} = \sum_{i=1}^{n_k} \mathbb{I}(X^k_{i,j} \geq 1).\]

Next, let \(\texttt{n_nonzero_control}_{kj}\) be the number of NT-containg cells excluding NTC \(k\) (i.e., the cells containing NTCs in the set \([d] \setminus \{k\}\)) with nonzero expression, i.e.,

\[ \texttt{n_nonzero_control}_{jk} = \sum_{k' \neq k} \texttt{n_nonzero_treatment}_{k'j} \]

Finally, let effective_sample_size be the sum of n_nonzero_treatment and n_nonzero_control. On the negative control data, effective_sample_size is a function of the gene only:

\[ \texttt{effective_sample_size}_j = \sum_{k = 1}^d \texttt{n_nonzero_treatment}_{jk} \]

I study how effective_sample_size, n_nonzero_treatment, and n_nonzero_control relate to one another.

Empirical relationship between effective_sample_size, n_nonzero_treatment, and n_nonzero_control

First, I plot n_nonzero_treatment against effective_sample_size, faceting by dataset. In other words, I plot

\[ \{ (\texttt{n_nonzero_treatment}_{jk}, \texttt{effective_sample_size}_{j})\}_{j,k} \]

for each dataset.

Clearly, n_nonzero_treatment and effective_sample_size are correlated. This makes sense: for a given gene, if expression is high in cells with a given NTC, then expression likely is high in cells with the remaining NTCs. (Put differently, expression levels are correlated across NTCs: highly expressed genes tend to be highly expressed across NTCs, and similarly for lowly expressed genes.)

Next, I plot n_nonzero_control against effective_sample_size, i.e.

$$\{(\texttt{n_nonzero_control}_{jk}, \texttt{effective_sample_size}_{j})\}_{j,k}$$

Clearly, n_nonzero_control and effective_sample_size are basically the same variable. Again, this makes sense: for a given gRNA, n_nonzero_control is equal to effective_sample_size minus the number of cells with nonzero expression containing that gRNA. Finally, I plot n_nonzero_treatment against n_nonzero_control.

Again, these variables are correlated (but certainly not perfectly) for the same reason that n_nonzero_treatment and effective_sample_size are correlated.

I conclude on the basis of this analysis that, while effective_sample_size is a useful single-number summary of sample size, it is meaningful to look at both at n_nonzero_treatment and n_nonzero_control when examining sample size-related issues.

Comparison across datasets

Here, I make some comparisons related to sample size across datasets. First, I make a barplot of the number of NTCs per dataset.

The Frangieh data contain the greatest number of NTCs, followed by the Schraivogel data and then the Papalexi data. Next, I plot the average number of nonzero cells per NTC (i.e., n_nonzero_treatment) across datasets.

I print the median n_nonzero_treatment value for each dataset below.

sample_size_df_nt |>
  group_by(dataset) |>
  summarize(m = median(n_nonzero_treatment))
## # A tibble: 7 × 2
##   dataset                                      m
##   <fct>                                    <dbl>
## 1 frangieh_co_culture_gene                   9  
## 2 frangieh_control_gene                      6  
## 3 frangieh_ifn_gamma_gene                    7  
## 4 papalexi_eccite_screen_gene               29  
## 5 schraivogel_enhancer_screen               52.5
## 6 schraivogel_ground_truth_perturbseq_gene  37  
## 7 schraivogel_ground_truth_tapseq_gene     165

The Frangieh data have the smallest number nonzero cells per NTC (median under 10). The Papalexi data are indermediate (median about 30), and the Schraivogel data have the greatest number of cells per NTC (median 37 on perturb-seq data; even greater median on other datasets). Finally, I plot the effective_sample_size per dataset.

I print the per-dataset median effective sample size below.

sample_size_df_nt |>
  group_by(dataset) |>
  summarize(m = median(effective_samp_size))
## # A tibble: 7 × 2
##   dataset                                      m
##   <fct>                                    <dbl>
## 1 frangieh_co_culture_gene                   712
## 2 frangieh_control_gene                      436
## 3 frangieh_ifn_gamma_gene                    581
## 4 papalexi_eccite_screen_gene                315
## 5 schraivogel_enhancer_screen               1764
## 6 schraivogel_ground_truth_perturbseq_gene  1186
## 7 schraivogel_ground_truth_tapseq_gene      5775

We see that the effective sample sizes are in a similar range on the Frangieh and Papalexi datasets (while the effective sample sizes on the Schraivogel datasets are larger). The Frangieh data have fewer nonzero cells per NTC but many more NTCs than the Papalexi data, causing the effective sample size to be roughly equal across the Frangieh and Papalexi data.

Now, I examine the number of nonzero cells per targeting gRNA (grouped and ungrouped) across datasets.

sample_size_df_t |>
  group_by(dataset) |>
  summarize(median_nonzero = median(n_nonzero_cells))
## # A tibble: 7 × 2
##   dataset                                  median_nonzero
##   <fct>                                             <dbl>
## 1 frangieh_co_culture_gene                              5
## 2 frangieh_control_gene                                 3
## 3 frangieh_ifn_gamma_gene                               4
## 4 papalexi_eccite_screen_gene                          17
## 5 schraivogel_enhancer_screen                          14
## 6 schraivogel_ground_truth_perturbseq_gene             26
## 7 schraivogel_ground_truth_tapseq_gene                117

Next, I plot the number of nonzero cells per gRNA group:

grouped_ss_df <- sample_size_df_t |>
  group_by(dataset, grna_group, response_id) |>
  summarize(s = sum(n_nonzero_cells))
## `summarise()` has grouped output by 'dataset', 'grna_group'. You can override
## using the `.groups` argument.
grouped_ss_df$dataset_rename <- stringr::str_to_title(gsub(pattern = "_",
                                                            replacement = " ",
                                                            x = grouped_ss_df$dataset))
paper <- factor(x = grouped_ss_df$dataset,
       levels = c("frangieh_co_culture_gene",
                  "frangieh_control_gene",
                  "frangieh_ifn_gamma_gene",
                  "papalexi_eccite_screen_gene",
                  "schraivogel_ground_truth_perturbseq_gene",
                  "schraivogel_enhancer_screen",
                  "schraivogel_ground_truth_tapseq_gene"),
       labels = c(rep("Frangieh", 3), "Papalexi", rep("Schraivogel", 3)))
grouped_ss_df$paper <- paper

grouped_ss_df |>
  group_by(dataset) |>
  sample_n(10000, replace = TRUE) |>
  ggplot(mapping = aes(x = dataset_rename,
                       y = s + 1,
                       fill = paper)) +
  geom_violin() +
  geom_boxplot() +
  scale_y_log10() +
  ylab("N nonzero cells per targeting gRNA (ungrouped)") +
  xlab("") +
  theme_bw() +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))

grouped_ss_df 
## # A tibble: 11,769,880 × 6
## # Groups:   dataset, grna_group [2,607]
##    dataset                  grna_group response_id     s dataset_rename    paper
##    <fct>                    <fct>      <fct>       <int> <chr>             <fct>
##  1 frangieh_co_culture_gene A2M        A1BG           43 Frangieh Co Cult… Fran…
##  2 frangieh_co_culture_gene A2M        A1BG-AS1       15 Frangieh Co Cult… Fran…
##  3 frangieh_co_culture_gene A2M        A4GALT          2 Frangieh Co Cult… Fran…
##  4 frangieh_co_culture_gene A2M        AAAS           67 Frangieh Co Cult… Fran…
##  5 frangieh_co_culture_gene A2M        AACS           20 Frangieh Co Cult… Fran…
##  6 frangieh_co_culture_gene A2M        AADAT          13 Frangieh Co Cult… Fran…
##  7 frangieh_co_culture_gene A2M        AAED1          36 Frangieh Co Cult… Fran…
##  8 frangieh_co_culture_gene A2M        AAGAB          84 Frangieh Co Cult… Fran…
##  9 frangieh_co_culture_gene A2M        AAK1           84 Frangieh Co Cult… Fran…
## 10 frangieh_co_culture_gene A2M        AAMDC          56 Frangieh Co Cult… Fran…
## # … with 11,769,870 more rows

Number of hypotheses

First, I plot the number of genes per dataset (after applying mild QC of filtering out genes expressed in fewer than 0.005 of cells).

All datasets have roughly the same number of genes (about 15,000) except for the TAP-seq and enhancer screen datasets (which of course have fewer genes). Next, I plot the number of pairs across datasets.

The Frangieh data of course contain more pairs because they contain more NT gRNAs.

Exact SCEPTRE

I applied the exact version of SCEPTRE (with covariates and using B = 300,000 permutations) to the Papalexi data and Frangieh data.

First, I plot the SCEPTRE exact results on the Frangieh data, comparing to the skew-normal version of SCEPTRE (with covariates) and NB regression.

SCEPTRE (exact) exhibits good calibration, enabling us to conclude that a permutation test works on the Frangieh data. The other methods (NB regression and SCEPTRE (SN)) are not as well calibrated, likely because asymptotics have not yet kicked in. Next, I plot the results on the Papalexi gene expression data, first on an untransformed scale.

Clearly, in the bulk of the distribution, SCEPTRE (exact) is doing the best, followed by NB regression and then by SCEPTRE (SN). I now plot the same results on a transformed scale.

We see that on the transformed scale, SCEPTRE (exact) performs the worst (beyond about 0.001), while NB regression is the best! What is going on here? In particular, why is SCEPTRE (exact) performing worse than SCEPTRE (SN) in the tail? Below, I plot the SCEPTRE (exact) p-values against the SCEPTRE (SN) p-values on a transformed scale.

The SCEPTRE (SN) p-values are in broad agreement with the SCEPTRE (exact) p-values; however, the scatterplot has an obvious “appendage” below the diagonal. The pairs in this “appendage” have much smaller SCEPTRE (exact) p-values than SCEPTRE (SN) p-values. Why does this appendage appear? And are the pairs in this appendage “qualitatively different” (in some way) than the other pairs? I plot the points in the appendage below and pick a few to investigate.